494 research outputs found
Multigrain shared memory
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (p. 197-203).by Donald Yeung.Ph.D
Studying Directory Access Patterns via Reuse Distance Analysis and Evaluating Their Impact on Multi-Level Directory Caches
The trend for multicore CPUs is towards increasing core count. One of
the key limiters to scaling will be the on-chip directory cache. Our
work investigates moving portions of the directory away from the cores,
perhaps to off-chip DRAM, where ample capacity exists. While such
multi-level directory caches exhibit increased latency, several aspects
of directory accesses will shield CPU performance from the slower
directory, including low access frequency and latency hiding underneath
data accesses to main memory.
While multi-level directory caches have been studied previously, no work
has of yet comprehensively quantified the directory access patterns
themselves, making it difficult to understand multi-level behavior in
depth. This paper presents a framework based on multicore reuse
distance for studying directory cache access patterns. Using our
analysis framework, we show between 69-93% of directory entries are
looked up only once or twice during their liftimes in the directory
cache, and between 51-71% of dynamic directory accesses are latency
tolerant. Using cache simulations, we show a very small L1 directory
cache can service 80% of latency critical directory lookups. Although a
significant number of directory lookups and eviction notifications must
access the slower L2 directory cache, virtually all of these are latency
tolerant
Exploiting Multi-Loop Parallelism on Heterogeneous Microprocessors
Heterogeneous microprocessors integrate CPUs and GPUs on the same chip,
providing fast CPU-GPU communication and enabling cores to compute on
data "in place." These advantages will permit integrated GPUs to exploit
a smaller unit of parallelism. But one challenge will be exposing
sufficient parallelism to keep all of the on-chip compute resources
fully utilized. In this paper, we argue that integrated CPU-GPU chips
should exploit parallelism from multiple loops simultaneously. One
example of this is nested parallelism in which one or more inner SIMD
loops are nested underneath a parallel outer (non- SIMD) loop. By
scheduling the parallel outer loop on multiple CPU cores, multiple
dynamic instances of the inner SIMD loops can be scheduled on the GPU
cores. This boosts GPU utilization and parallelizes the non-SIMD code.
Our preliminary results show exploiting such multi-loop parallelism
provides a 3.12x performance gain over exploiting parallelism from
individual loops one at a time
Pipelined CPU-GPU Scheduling for Caches
Heterogeneous microprocessors integrate a CPU and GPU with a shared cache hierarchy on the same chip, affording low-overhead communication between the CPU and GPU's cores. Often times, large array data structures are communicated from the CPU to the GPU and back. While the on-chip cache hierarchy can support such CPU-GPU producer-consumer sharing, this almost never happens due to poor temporal reuse. Because the data structures can be quite large, by the time the consumer reads the data, it has been evicted from cache even though the producer had brought it on-chip when it originally wrote the data. As a result, the CPU-GPU communication happens through main memory instead of the cache, hurting performance and energy.
This paper exploits the on-chip caches in a heterogeneous microprocessor to improve CPU-GPU communication efficiency. We divide streaming computations executed by the CPU and GPU that exhibit producer-consumer sharing into chunks, and overlap the execution of CPU chunks with GPU chunks in a software pipeline. To enforce data dependences, the producer executes one chunk ahead of the consumer at all times. We also propose a low-overhead synchronization mechanism in which the CPU directly controls thread-block scheduling in the GPU to maintain the producer's "run-ahead distance" relative to the consumer. By adjusting the chunk size or run-ahead distance, we can make the CPU-GPU working set fit in the last-level cache, thus permitting the producer-consumer sharing to occur through the LLC. We show through simulation that our technique reduces the number of DRAM accesses by 30.4%, improves performance by 26.8%, and lowers memory system energy by 27.4% averaged across 7 benchmarks
Symbiotic Cache Resizing for CMPs with Shared LLC
This paper investigates the problem of finding the optimal sizes of private caches and a shared LLC in CMPs. Resizing private and shared caches in modern CMPs is one way to squeeze wasteful power consumption out of architectures to improve power efficiency. However, shrinking each private/shared cache has different impact on the performance loss and the power savings to the CMPs because each cache contributes differently to performance and power. It is beneficial for both performance and power to shrink the LRU way of the private/shared cache which saves power most and increases data traffic least.
This paper presents Symbiotic Cache Resizing (SCR), a runtime technique that reduces the total power consumption of the on-chip cache hierarchy in CMPs with a shared LLC. SCR turnoffs private/shared-cache ways in an inter-core and inter-level manner so that each disabling achieves best power saving while maintaining high performance. SCR finds such optimal cache sizes by utilizing greedy algorithms that we develop in this study. In particular, Prioritized Way Selection picks the most power-inefficient way. LLC-Partitioning-aware Prioritized Way Selection finds optimal cache sizes from the multi-level perspective. Lastly, Weighted Threshold Throttling finds optimal threshold per cache level. We evaluate SCR in two-core, four-core and eight-core systems. Results show that SCR saves 13% power in the on-chip cache hierarchy and 4.2% power in the system compared to an even LLC partitioning technique. SCR saves 2.7X more power in the cache hierarchy than the state-of-the-art LLC resizing technique while achieving better performance
Studying the impact of multicore processor scaling on directory techniques via reuse distance analysis
Abstract—Researchers have proposed numerous directory techniques to address multicore scalability whose behavior de-pends on the CPU’s particular configuration, e.g. core count and cache size. As CPUs continue to scale, it is essential to explore the directory’s architecture dependences. However, this is challenging using detailed simulation given the large number of CPU configurations that are possible. This paper proposes to use multicore reuse distance analysis to study coherence directories. We develop a framework to extract the directory access stream from parallel LRU stacks, enabling rapid analysis of the directory’s accesses and contents across both core count and cache size scaling. We also implement our framework in a profiler, and apply it to gain insights into multicore scaling’s impact on the directory. Our profiling results show that directory accesses reduce by 3.5x across data cache size scaling, suggesting techniques that tradeoff access latency for reduced capacity or conflicts become increasingly effective as cache size scales. We also show the portion of on-chip memory devoted to the directory cache can be reduced by 53.3 % across data cache size scaling, thus lowering the over-provisioning needed at large cache sizes. Finally, we validate our RD-based directory analyses, and find they are within 13% of cache simulations in terms of access count, on average. I
Parallelization of the SSCA#3 Benchmark on the RAW Processor
The MIT Raw machine provides a point-to-point interconnection network for transferring register values between tiles. The programmer schedules the network communication for each tile by himself/herself and guarantees the correctness. It is not easy to parallelize benchmarks by hand for all possible tile configurations on the Raw processor. To overcome this problem, we develop a communication library and a switch code generator to create the switch code for each tile automatically. We implement our techniques for the SSCA#3 (SAR Sensor Processing, Knowledge Formation) benchmark, and evaluate the parallelism on a physical Raw processor. The experimental results show the SSCA#3 benchmark has dense matrix operations with abundant parallelism. Using 16 tiles, the ’SAR image formation’ procedure achieves a speedup of 13.86, and the speedup of the ’object detection’ procedure is 9.98
Memory Performance Analysis for Parallel Programs Using Concurrent Reuse Distance
Performance on multicore processors is determined largely by on-chip
cache. Computer architects have conducted numerous studies in the past
that vary core count and cache capacity as well as problem size to
understand impact on cache behavior. These studies are very costly due
to the combinatorial design spaces they must explore.
Reuse distance (RD) analysis can help architects explore multicore cache
performance more efficiently. One problem, however, is multicore RD
analysis requires measuring concurrent reuse distance (CRD) profiles
across thread-interleaved memory reference streams. Sensitivity to
memory interleaving makes CRD profiles architecture dependent,
undermining RD analysis benefits. But for parallel programs with
symmetric threads, CRD profiles vary with architecture tractably: they
change only slightly with cache capacity scaling, and shift predictably
to larger CRD values with core count scaling. This enables analysis of a
large number of multicore configurations from a small set of measured
CRD profiles.
This paper investigates using RD analysis to efficiently analyze
multicore cache performance for parallel programs, making several
contributions. First, we characterize how CRD profiles change with core
count and cache capacity. One of our findings is core count scaling
degrades locality, but the degradation only impacts last-level caches
(LLCs) below 16MB for our benchmarks and problem sizes, increasing to
128MB if problem size scales by 64x. Second, we apply reference groups
to predict CRD profiles across core count scaling, and evaluate
prediction accuracy. Finally, we use CRD profiles to analyze multicore
cache performance. We find predicted CRD profiles can estimate LLC MPKI
within 76% of simulation for configurations without pathologic cache
conflicts in 1/1200th the time to perform simulation of the full design
space
Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis
Understanding multicore memory behavior is crucial, but can be challenging due to the complex cache hierarchies em-ployed in modern CPUs. In today’s hierarchies, performance is determined by complicated thread interactions, such as interference in shared caches and replication and communi-cation in private caches. Researchers normally perform ex-tensive simulations to study these interactions, but this can be costly and not very insightful. An alternative is multicore reuse distance (RD) analysis, which can provide extremely rich information about multicore memory behavior. In this paper, we apply multicore RD analysis to better understand cache system design. We focus on loop-based parallel pro-grams, an important class of programs for which RD anal-ysis provides high accuracy. We propose a novel framework to identify optimal multicore cache hierarchies, and extract several new insights. We also characterize how the optimal cache hierarchies vary with core count and problem size
- …